Part I - (Global salaries in AI/ML and Big Data)¶

by (Nicholas Seswa Anungo)¶

Table of Content¶

  • Introduction
  • Preliminary Wrangling
    • Data Gathering
    • Assessing Data
    • Data Cleaning
  • Univariate Exploration
  • Bivariate Exploration
  • Multivariate Exploration
  • Conclusions

Introduction¶

The dataset is about information collected on salary anonymously from professionals all over the world in the AI/ML and Big Data space.

The dataset basically contains a single table with all salary information structured as follows:

work_year The year the salary was paid.

experience_level The experience level in the job during the year with the following possible values:

EN - Entry-level / Junior

MI - Mid-level / Intermediate

SE - Senior-level / Expert

EX - Executive-level / Director

employment_type The type of employement for the role:

PT - Part-time

FT - Full-time

CT - Contract

FL - Freelance

job_title The role worked in during the year.

salary The total gross salary amount paid.

salary_currency The currency of the salary paid as an ISO 4217 currency code.

salary_in_usd The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).

employee_residence Employee's primary country of residence in during the work year as an ISO 3166 country code.

remote_ratio The overall amount of work done remotely, possible values are as follows:

0 - No remote work (less than 20%)

50 - Partially remote

100 - Fully remote (more than 80%)

company_location The country of the employer's main office or contracting branch as an ISO 3166 country code.

company_size The average number of people that worked for the company during the year:

S - less than 50 employees (small)

M - 50 to 250 employees (medium)

L - more than 250 employees (large)

Original Data Source from ai-jobs.net Salaries website- https://salaries.ai-jobs.net/download/

ISO 3166 Country Code - https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes

Preliminary Wrangling¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import requests as rt
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px

%matplotlib inline
In [67]:
#to make plotly interactive offline
import plotly.io as pio
pio.renderers.default='notebook'

Data Gathering¶

The dataset used in this project can be accessed programmatic by using the this url provided 'https://salaries.ai-jobs.net/download/salaries.csv'.

In [2]:
#Using request to download the dataset using the url
url = 'https://salaries.ai-jobs.net/download/salaries.csv'
response = rt.get(url)
In [3]:
#saving the downloaded csv file
with open('salaries.csv', mode = 'wb') as file:
    file.write(response.content)
In [4]:
#reading the downloaded file
Salaries = pd.read_csv('salaries.csv')

Assessing Data¶

In this section the dataset is inspected for quality issues and tidiness. Method used in the assessment include both visual assessment and programmatic assessment using pandas liabrary.

In [5]:
#displaying the dataset
Salaries
Out[5]:
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
0 2022 SE FT Data Engineer 195700 USD 195700 US 0 US M
1 2022 SE FT Data Engineer 130500 USD 130500 US 0 US M
2 2022 SE FT ML Engineer 130000 USD 130000 US 100 US M
3 2022 SE FT ML Engineer 84000 USD 84000 US 100 US M
4 2022 MI FT Data Operations Engineer 100000 USD 100000 US 100 US M
... ... ... ... ... ... ... ... ... ... ... ...
988 2020 SE FT Data Scientist 412000 USD 412000 US 100 US L
989 2021 MI FT Principal Data Scientist 151000 USD 151000 US 100 US L
990 2020 EN FT Data Scientist 105000 USD 105000 US 100 US S
991 2020 EN CT Business Data Analyst 100000 USD 100000 US 100 US L
992 2021 SE FT Data Science Manager 7000000 INR 94665 IN 50 IN L

993 rows × 11 columns

In [6]:
#displaying sample of 7 rows randomly
Salaries.sample(7)
Out[6]:
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
341 2022 SE FT Data Engineer 200000 USD 200000 US 100 US M
278 2022 SE FT Data Engineer 145000 USD 145000 US 100 US M
162 2022 MI FT Data Analyst 113000 USD 113000 US 0 US L
360 2022 SE FT Data Engineer 210000 USD 210000 US 100 US M
836 2021 MI FT Data Analyst 80000 USD 80000 US 100 US L
400 2022 SE FT Data Engineer 154600 USD 154600 US 100 US L
93 2022 SE FT Data Scientist 191475 USD 191475 US 100 US M
In [7]:
#Getting the general information of the dataset
Salaries.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 993 entries, 0 to 992
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           993 non-null    int64 
 1   experience_level    993 non-null    object
 2   employment_type     993 non-null    object
 3   job_title           993 non-null    object
 4   salary              993 non-null    int64 
 5   salary_currency     993 non-null    object
 6   salary_in_usd       993 non-null    int64 
 7   employee_residence  993 non-null    object
 8   remote_ratio        993 non-null    int64 
 9   company_location    993 non-null    object
 10  company_size        993 non-null    object
dtypes: int64(4), object(7)
memory usage: 85.5+ KB
In [8]:
#Checking for the missing values
Salaries.isna().sum()
Out[8]:
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64
In [9]:
Salaries.describe()
Out[9]:
work_year salary salary_in_usd remote_ratio
count 993.000000 9.930000e+02 993.000000 993.000000
mean 2021.623364 2.685874e+05 118863.002014 67.623364
std 0.621472 1.242680e+06 68034.251369 43.319787
min 2020.000000 2.324000e+03 2324.000000 0.000000
25% 2021.000000 7.500000e+04 69047.000000 0.000000
50% 2022.000000 1.230000e+05 112987.000000 100.000000
75% 2022.000000 1.700000e+05 159000.000000 100.000000
max 2022.000000 3.040000e+07 600000.000000 100.000000
In [10]:
#Displaying the unique values in the experience_level column
Salaries['experience_level'].unique()
Out[10]:
array(['SE', 'MI', 'EN', 'EX'], dtype=object)
In [11]:
#Displaying the unique values in the employment_typel column
Salaries['employment_type'].unique()
Out[11]:
array(['FT', 'CT', 'PT', 'FL'], dtype=object)
In [12]:
#Displaying the unique values in the remote_ratio column
Salaries['remote_ratio'].unique()
Out[12]:
array([  0, 100,  50], dtype=int64)
In [13]:
#Displaying the unique values in the company_size column
Salaries['company_size'].unique()
Out[13]:
array(['M', 'L', 'S'], dtype=object)

Quality Issues¶

After going through the entire dataset I was able to spot seven quality issues which are recorded below.

  1. The salary and salary_currency columns are not needed since we have a column of salary_in_usd.

  2. The entries in the experience_level column are abbreviation as 'EN', 'MI', 'SE', 'EX'.

  3. The entries in employment_type column are abbreviation as 'FT', 'PT', 'FL', 'CT'.

  4. The entries in the company_size column are abbreviation as 'S', 'M', 'L'.

  5. Erroneous datatypes of remote_ratio should be "str" not "int".

  6. The values in the remote_ratio column (100, 50, 0) need to be replaces with the actual entries.

  7. Erroneous datatypes of work_year should be "string" not "int".

Tidy Issues¶

After assessing the dataset it is clear that the data is clean and does not have any structural problem.

Data Cleaning¶

In this section, the above mentioned issues are cleaned so as to achieve high quality dataset which is tidy to avoid giving the wrong insights. The process involves defining the the issue, coding and then testing.

In [15]:
#Making copy of the orinal dataset set before cleaning
Salary_clean = Salaries.copy()

Issue 1:

The salary and salary_currency columns are not needed since we have a column of salary_in_usd.

Define:

Drop salary and salary_currency columns

Code:

In [16]:
#using drop function to drop salary and salary_currency columns
Salary_clean.drop(['salary', 'salary_currency'], axis=1, inplace=True)

Test:

In [17]:
#checking if the columns have been dropped, by displaying columns
Salary_clean.columns
Out[17]:
Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary_in_usd', 'employee_residence', 'remote_ratio',
       'company_location', 'company_size'],
      dtype='object')

Issue 2:

The entries in the experience_level column are abbreviation as 'EN', 'MI', 'SE', 'EX'.

Define:

Replace (EN, MI, SE, EX) in experience_level column to ('Entry-level', 'Medium-level', 'Senior-level', 'Excutive-level') respectively.

Code:

In [18]:
#Using replace function to replace the abbreviations in the experience_level column
Salary_clean['experience_level'].replace(['EN', 'MI', 'SE', 'EX'], ['Entry-level', 'Medium-level', 'Senior-level', \
                                                                    'Excutive-level'], inplace=True)

Test:

In [19]:
#Displaying the unique entries in the experince_level column
Salary_clean['experience_level'].unique()
Out[19]:
array(['Senior-level', 'Medium-level', 'Entry-level', 'Excutive-level'],
      dtype=object)

Issue 3:

The entries in employment_type column are abbreviation as 'FT', 'PT', 'FL', 'CT'.

Define:

Replace (FT, PT, CT, FL) in employment_type column to ('Full-time', 'Part-time', 'Contract', 'Freelance') respectively.

Code:

In [20]:
#Using replace function to replace the abbreviations in the employment_type column
Salary_clean['employment_type'].replace(['FT', 'PT', 'CT', 'FL'], ['Full-time', 'Part-time', 'Contract', 'Freelance'],\
                                       inplace=True)

Test:

In [21]:
#Displaying the unique entries in the employment_type column
Salary_clean['employment_type'].unique()
Out[21]:
array(['Full-time', 'Contract', 'Part-time', 'Freelance'], dtype=object)

Issue 4:

The entries in the company_size column are abbreviation as 'S', 'M', 'L'.

Define:

Replace (S, M, L) in company_size column to ('Small', 'Medium', 'Large') respectively.

Code:

In [22]:
#Using replace function to replace the abbreviations in the company_size column
Salary_clean['company_size'].replace(['S', 'M', 'L'], ['Small', 'Medium', 'Large'], inplace=True)

Test:

In [23]:
#Displaying the unique entries in the company_size column
Salary_clean['company_size'].unique()
Out[23]:
array(['Medium', 'Large', 'Small'], dtype=object)

Issue 5:

Erroneous datatypes of remote_ratio should be "str" not "int".

Define:

Convert assigned datatype of remote_ratio from integer to string

Code:

In [24]:
#USing pandas to convert the datatype of remote_ratio to a string
Salary_clean['remote_ratio'] =Salary_clean['remote_ratio'].astype(str)

Test:

In [25]:
#Checking the remote_ratio datatype
type(Salary_clean['remote_ratio'][2])
Out[25]:
str

Issue 6:

The values in the remote_ratio column (100, 50, 0) need to be replaces with the actual entries.

Define:

Replace (100, 50, 0) in remote_ratio column to ('Remote', 'Hybrid', 'On-site') respectively.

Code:

In [26]:
#Using replace function to replace the numbers in the remote_ratio column
Salary_clean['remote_ratio'].replace(['100', '50', '0'], ['Remote', 'Hybrid', 'On-site'], inplace=True)

Test:

In [27]:
#Displaying the unique entries in the remote_ratio column
Salary_clean['remote_ratio'].unique()
Out[27]:
array(['On-site', 'Remote', 'Hybrid'], dtype=object)

Issue 7:

Erroneous datatypes of work_year should be "string" not "int".

Define:

Convert assigned datatype of work_year from integer to string

Code:

In [28]:
#USing pandas to convert the datatype of work_year to a string
Salary_clean['work_year'] = Salary_clean['work_year'].astype(str)

Test:

In [29]:
#Checking the work_year datatype
Salary_clean['work_year']
Out[29]:
0      2022
1      2022
2      2022
3      2022
4      2022
       ... 
988    2020
989    2021
990    2020
991    2020
992    2021
Name: work_year, Length: 993, dtype: object
In [30]:
#Saving the cleaned dataset
Salary_clean.to_csv('Salary.csv', index=False)
In [31]:
# load in the dataset into a pandas dataframe
Salary = pd.read_csv('Salary.csv')
In [32]:
# the general overview of data shape and composition
print(Salary.shape)
print(Salary.info())
print(Salary.head(10))
(993, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 993 entries, 0 to 992
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           993 non-null    int64 
 1   experience_level    993 non-null    object
 2   employment_type     993 non-null    object
 3   job_title           993 non-null    object
 4   salary_in_usd       993 non-null    int64 
 5   employee_residence  993 non-null    object
 6   remote_ratio        993 non-null    object
 7   company_location    993 non-null    object
 8   company_size        993 non-null    object
dtypes: int64(2), object(7)
memory usage: 69.9+ KB
None
   work_year experience_level employment_type                 job_title  \
0       2022     Senior-level       Full-time             Data Engineer   
1       2022     Senior-level       Full-time             Data Engineer   
2       2022     Senior-level       Full-time               ML Engineer   
3       2022     Senior-level       Full-time               ML Engineer   
4       2022     Medium-level       Full-time  Data Operations Engineer   
5       2022     Medium-level       Full-time  Data Operations Engineer   
6       2022     Medium-level       Full-time             Data Engineer   
7       2022     Medium-level       Full-time             Data Engineer   
8       2022     Senior-level       Full-time             Data Engineer   
9       2022     Senior-level       Full-time             Data Engineer   

   salary_in_usd employee_residence remote_ratio company_location company_size  
0         195700                 US      On-site               US       Medium  
1         130500                 US      On-site               US       Medium  
2         130000                 US       Remote               US       Medium  
3          84000                 US       Remote               US       Medium  
4         100000                 US       Remote               US       Medium  
5          60000                 US       Remote               US       Medium  
6          81601                 GB       Remote               GB       Medium  
7          69047                 GB       Remote               GB       Medium  
8         141300                 US      On-site               US       Medium  
9         102100                 US      On-site               US       Medium  
In [33]:
Salary['job_title'].nunique()
Out[33]:
57
In [34]:
# descriptive statistics for numeric variables
Salary.describe()
Out[34]:
work_year salary_in_usd
count 993.000000 993.000000
mean 2021.623364 118863.002014
std 0.621472 68034.251369
min 2020.000000 2324.000000
25% 2021.000000 69047.000000
50% 2022.000000 112987.000000
75% 2022.000000 159000.000000
max 2022.000000 600000.000000
In [35]:
Salary.columns
Out[35]:
Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary_in_usd', 'employee_residence', 'remote_ratio',
       'company_location', 'company_size'],
      dtype='object')

What is the structure of your dataset?¶

The dataset has a RangeIndex of 955 entries which forms the rows, and a total of 9 columns. There are a total of 57 different types of jobs in the dataset with 8 features (work_year, experience_level, employment_type, salary_in_usd, employee_residence, remote_ratio, company_location, company_size). Most variables are categorical in nature i.e variables experince_level, employment_type, and company_size.

What is/are the main feature(s) of interest in your dataset?¶

I'm mostly interested in figuring out what features are best for determining the amount salary of the jobs in the dataset.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

I expect that experience_level will be the strongest determinant on the job's salary: the higher the exprience level, the higher the salary. I also think that other variables: company_size, job_title, employee_residence, work_year and remote_ratio, will also have effects on the salary, though not as much as the main effect of experience_level.

Univariate Exploration¶

Lets start by setting the theme and create a new dataset without the outliers in salary_in_usd column.

In [38]:
#loading the background theme to be used in the plots
sb.set_theme()
In [39]:
#Using a z-score method to detect the outliers in the salary_in_usd column: 
Salary['z_score'] = (Salary['salary_in_usd'] - Salary['salary_in_usd'].mean()) / Salary['salary_in_usd'].std()
Salary.head()
Out[39]:
work_year experience_level employment_type job_title salary_in_usd employee_residence remote_ratio company_location company_size z_score
0 2022 Senior-level Full-time Data Engineer 195700 US On-site US Medium 1.129387
1 2022 Senior-level Full-time Data Engineer 130500 US On-site US Medium 0.171046
2 2022 Senior-level Full-time ML Engineer 130000 US Remote US Medium 0.163697
3 2022 Senior-level Full-time ML Engineer 84000 US Remote US Medium -0.512433
4 2022 Medium-level Full-time Data Operations Engineer 100000 US Remote US Medium -0.277257
In [40]:
#Using the calculated z-score to mark the outliers if the z-score is above 3 or below -3.
Salary[(Salary['z_score'] < -3) | (Salary['z_score'] > 3)]
Out[40]:
work_year experience_level employment_type job_title salary_in_usd employee_residence remote_ratio company_location company_size z_score
641 2022 Excutive-level Full-time Data Engineer 324000 US Remote US Medium 3.015202
695 2022 Senior-level Full-time Data Analytics Lead 405000 US Remote US Large 4.205779
700 2022 Senior-level Full-time Applied Data Scientist 380000 US Remote US Large 3.838317
754 2020 Medium-level Full-time Research Scientist 450000 US On-site US Medium 4.867210
811 2021 Medium-level Full-time Financial Data Analyst 450000 US Remote US Large 4.867210
910 2021 Excutive-level Contract Principal Data Scientist 416000 US Remote US Small 4.367462
933 2020 Excutive-level Full-time Director of Data Science 325000 US Remote US Large 3.029900
939 2021 Excutive-level Full-time Principal Data Engineer 600000 US Remote US Large 7.071982
985 2021 Medium-level Full-time Applied Machine Learning Scientist 423000 US Hybrid US Large 4.470351
988 2020 Senior-level Full-time Data Scientist 412000 US Remote US Large 4.308668
In [41]:
#To remove these outlers in the salary_in_usd column
new_df = Salary[(Salary['z_score'] < 3) & (Salary['z_score'] > -3)]

This new data frame gives the dataset that is free from outliers having a z-score between 3 and -3.

Let start by looking into the distribution of experience level of the dataset

In [43]:
# start with a standard-scaled plot
plt.figure(figsize = [10,6])
base_color = sb.color_palette()[0]

sb.countplot(data =Salary, x ='experience_level', color=base_color)
plt.title('Jobs Experince Levels')
plt.xlabel('Experience')
plt.ylabel('Number of Jobs');

In the bar plotof experience level, the most posted job experience is senior level jobs with more than 500 jobs followed by medium level jobs which is slightly bellow 300 jobs. The third level jobs being entry level jobs slightly above 100 jobs, with the least job recorded being excutive level job around 50 jobs.

I want to look the distribution of salary range in the dataset, to get the idea of which is the most salary payble range

In [85]:
# Resize the chart, and have two plots side-by-side
plt.figure(figsize = [20, 8]) 

# histogram on left, example of too-large bin size
# 1 row, 2 cols, subplot 1
plt.subplot(1, 2, 1) 
plt.hist(data = new_df, x = 'salary_in_usd');

# histogram on right, example of too-small bin size
plt.subplot(1, 2, 2) # 1 row, 2 cols, subplot 2
bins = np.arange(0, new_df['salary_in_usd'].max()+3000, 3000)
plt.hist(data = new_df, x = 'salary_in_usd', bins = bins);

The plot above indicates that most frequent salary ranges between 100000 t0 150000 dollars, having an occurence of about 450. The least occurence of the salary is above 300000 dollars.

Now let's look at the percentage distribution of work_year of the dataset, to get the idea of which is the most year with the highest number of jobs

In [45]:
#Displaying the distribution of 
Work_year = Salary.groupby('work_year').size().reset_index().sort_values(by=0, ascending = False)
Work_year.head(10)
Out[45]:
work_year 0
2 2022 694
1 2021 224
0 2020 75
In [68]:
#Plotting pie chart of work yaer using plotly
fig =px.pie(Work_year, values=0, names='work_year', title='Jobs Created Per Year')
fig.show()

Most jobs recorded were for 2022 with a percentage of 69.4, which is thrice the total number of jobs in previous years. In 2021 the percentage is 22.9 of the total jobs recorded and 2020 recording the least amount of jobs with a percentage of 7.67.

Let's look at the proportions of the company_size

In [50]:
#Donut plot of company_size counts
plt.figure(figsize = [10,6])
sorted_counts = Salary['company_size'].value_counts()
plt.pie(sorted_counts, labels = sorted_counts.index, startangle = 90,
        counterclock = False, wedgeprops = {'width' : 0.5})
plt.axis('square')
plt.title('Company Size Proportion');

Medium companies enjoys the highest number of jobs recorded, with more than half of the tatal jobs, followed by large companies the least being small companies.

Now let's have a closs look at the remote-ratio percentage distributions

In [51]:
#Displaying the total count of unique entries of remote_ratio
Work = Salary.groupby('remote_ratio').size().reset_index().sort_values(by=0,ascending = False)
Work.head(10)
Out[51]:
remote_ratio 0
2 Remote 609
1 On-site 259
0 Hybrid 125
In [69]:
#Pie chart plot of remote_ratio
plt.figure(figsize = [10,6])
fig = px.pie(Work, values=0, names='remote_ratio', title='Mode of Work')
fig.show()
<Figure size 720x432 with 0 Axes>

In the dataset you can note that the most prefered mode of work is remote with 61.5%, followed by on-site jobs with a percentage of 25.9. The least prefered mode of work is hybrid with 12.7% of the total jobs recorded.

I want to look the plot of ten most frequent jobs titles

In [70]:
#using plotly to plot a bar graph og the top 10 jobs by value_count
top10_jobs = Salary['job_title'].value_counts()[:10]
fig =px.bar(y=top10_jobs.values, 
             x=top10_jobs.index, 
             color = top10_jobs.index,
             color_discrete_sequence=px.colors.sequential.PuBuGn,
             text=top10_jobs.values,
             title= 'Top 10 Jobs',
             template= 'plotly_dark')
fig.update_layout(
    xaxis_title="Job Titles",
    yaxis_title="Count",
    font = dict(size=17,family="Franklin Gothic"))
fig.show()

From the above figure of job titles is clear that the top most jobs in the market of AI/ML and Big Data are Data Scientist with 232 counts, followed by Data Engineer with 225 and the third place being Data Analyst with a frequency of 137. Other top 10 jobs include; Machine Learning Engineer, Analytics Engineer, Data Architeck, Data Science Manager, Research Scientist, Machine Lerning Scientist and the tenth being AI Scientist.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

The salary_inusd variable took on a large range of values and had some outliers. so I removed the outliers using z-core method to indentfy and remove them and created a new data named new_df. After transformation the data had only one peak at 100000 dollars.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

Most of the varibles did not report any issue, apart from the salary_in_usd which had the outliers that I mordified by removing the said outliers.

Bivariate Exploration¶

To start off with in this section, I want to look at the pairwise relationship present between features in the dataset.

In [51]:
#recording the nemerical and categorical variables
numeric_vars = ['Work_yaer', 'salary_in_usd']
categoric_vars = ['experience_level', 'employment_type', 'job_title', 'employment_residence', 'remote_ration', 'company_location']
In [54]:
#Violinplot of statistical distribution of salary against
plt.figure(figsize = [10,6])
sb.violinplot(data=new_df, y='salary_in_usd', x='work_year', inner='quartile')
plt.yticks([1e5, 2e5, 3e5, 4e5, 5e5, 6e5], ['100k', '200k', '300k', '400k', '500k', '600k'])
plt.title('Salaries Payed per Work Year')
plt.xlabel('Work Year')
plt.ylabel('Salary ($)');

From the violinplot we can observe that as the work year increases the amount of salary also tends to rise as expected. In the work year 2022 the medium salary range is above 2021 work yaer which in turn is greter that 2020 work year.

Let's look at other relationship between the experience and the company size variables

In [55]:
#Plotting a bar graph to show the relationship betwwen two variables experience_level and company_size
plt.figure(figsize = [10,6])
sb.countplot(data=Salary, x= 'experience_level', hue='company_size')
plt.title('Relationship between Experience level and Company size') 
plt.xlabel('Experience Level')
plt.ylabel('Count');

From the The Relationship between Experience level and Company size we can draw some insights and trends. First it is clear that most jobs are from Medium company size, entry level role being above in large company size. The second most company size enjoying the job market is Large company size, and the least being Small companies.

Lets have a look at the relationship betwwen company size and remote ratio

In [56]:
#Plotting a bar graph to show the relationship betwwen two variables remote_ratio and company_size
plt.figure(figsize = [10,6])
sb.countplot(data=Salary, x= 'company_size', hue='remote_ratio')
plt.title('Relationship Between Company size and Remote ratio')
plt.xlabel('Company Size')
plt.ylabel('Count');

In the Relationship Between Company size and Remote ratio bars, the most preferred remote ratio in all the sizes of the companies is Remote followed by Hybrid in two companies i.e Large and Small and the least being On-site. There is an intresting trend in Medium company, were the second preferred remote ratio is On-site and the least being Hybrid.

Now a want to have a look at the relationship between work year and the company size

In [57]:
#Plotting a bar graph to show the relationship betwwen two variables work_year and company_size
plt.figure(figsize = [10,6])
sb.countplot(data=Salary, x= 'work_year', hue='company_size')
plt.title('Relationship Between Year of Work and Company Size')
plt.xlabel('Year of Work')
plt.ylabel('Count');

In the bars above we can note that in work year 2020 and 2021 most of the jobs were from Large companies, this changed in 2022, which experience highest rise of jobs in Medium campanies. For individual companies performance in the period of 3 years is summarized bellow;

Medium Companies -There is a gradual increase in job between 2020 and 2021, the highest increase being 2022 with more that 5 times the cobination of the previous years.

Large Companies -There is a double increase of job from 2020 to 2021, with a slight drop in 2022.

Small Companies -The same trend experience in the Large Companies was also seen in this category of companies, were there was a slight increase of jobs in 2020 to 2021 and a slight drop in 2022.

Let's have a look how the salary is distributed in different level of work experience using the dataset in which the outliers have been removed.

In [58]:
#Plotting the distribution of work experience and salary
plt.figure(figsize = [10,6])
sb.boxplot(data=new_df, x='experience_level', y='salary_in_usd')
plt.xticks(rotation=15)
plt.yticks([1e5, 2e5, 3e5, 4e5, 5e5, 6e5], ['100k', '200k', '300k', '400k', '500k', '600k'])
plt.title('Distribution of Salary Between different Experince levels')
plt.xlabel('Work Experience')
plt.ylabel('Salary per year');

Considering the upper and the lower limits of the boxplot, we can come in terms that the highest salary is payed to those who have work experince of Excutive level, followed by Senior level, Medium level and lastly Entry level.

Let's look how the salary is distribution in different types of company size using the dataset in which the outliers have been removed.

In [59]:
#Plotting facetGrid of histograme of company_size and salary_in_usd
g = sb.FacetGrid(data = new_df, col = 'company_size', col_wrap=3)
g.map(plt.hist, "salary_in_usd")
g.set_titles('{col_name}')
g.set_xlabels('Salary ($)')
g.set_ylabels('Count');

In the dataset, most of the salary is distributed in medium company size, followed by large company and the lastly the small company. This distribution indicates that most of the payment salary was made in medium companies, which relates to the number jobs being highest in the said company size.

Let's have a look how the salary is distributed in different types of remote ratio using the dataset in which the salary outliers have been removed.

In [80]:
#Plotting facetGrid of histograme of remote_ratio and salary_in_usd
g = sb.FacetGrid(data = new_df, col = 'work_year', col_wrap=3)
g.map(plt.hist, "salary_in_usd")
g.set_titles('{col_name}')
g.set_xlabels('Salary ($)')
g.set_ylabels('Count');

In the dataset, most of the salary is distributed in 2022, followed by 2021 and lastly the 2020. This distribution indicates that most of the salary payment was made in 2022, which means that most of the jobs were recorded in 2022, then 2020 with the least numbers jobs being recorded.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

I have observed that as the years increase also the salary increase, which can only mean that most companies requires the data skill more. I also noted that work experience also affect the amount salary as expected were having more experience will earn you more salary.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

I have observed an interesting relationship, were medium companies also showed the highest number of jobs recorded, meaning most of the jobs recorded are from medium companies, hence the highest total amount of salary.

Multivariate Exploration¶

The main feature I want to explore in this part of the analysis is how the salary_in_usd relates with the other three categorical variables.

Lets beigin this section by looking how the experience_level and company_size relates with the salary_in_usd.

In [72]:
#Plotting facetGrid of boxnplot of company_size, experience_level and salary_in_usd
order = ['Small', 'Medium', 'Large']
g = sb.FacetGrid(data = new_df, col = 'experience_level', height = 5, col_wrap=2, row_order=None)
g.map(sb.boxplot, 'company_size', 'salary_in_usd', order =order)
g.set_titles('{col_name}')
plt.yticks([1e5, 2e5, 3e5, 4e5, 5e5, 6e5], ['100k', '200k', '300k', '400k', '500k', '600k'])
g.set_xlabels('Company Size')
g.set_ylabels('Salary ($)');

The boxplots above reveled that in Senior-level experience more salaries was recorded in the medium size and large companies, the least being small sized companies. In Medium-level experience, most of the salaries was recorded in both medium sized and large sized companies, with small recorded in small sized companies. In Entry-level, more salary was recorded by small companies, followed closely by large companies, least being medium companies. In Excutive-level, the medium and large companies recorded the highest salary, the least being small companies.

Lets beigin this section by looking how the experience_level and remote_ratio relates with the salary_in_usd.

In [79]:
#Plotting facetGrid of boxplots of experience_level, company_size and salary_in_usd
orders = ['Remote', 'Hybrid', 'On-site']
g = sb.FacetGrid(data = new_df, col = 'experience_level', height = 5, col_wrap=2,)
g.map(sb.boxplot, 'remote_ratio', 'salary_in_usd', order =orders)
g.set_titles('{col_name}')
plt.yticks([1e5, 2e5, 3e5, 4e5, 5e5, 6e5], ['100k', '200k', '300k', '400k', '500k', '600k'])
g.set_xlabels('Company Size')
g.set_ylabels('Salary ($)');

In the FacetGrid of Experience level, the interaction between remote_ratio and the experience level with the salary_in_usd is summarize in the points below.

  • In the entry level all the remote ratio recorded a median salary which is almost the same in the three category. Highest salary was recorded for Remote role followed by On-site then the least being Hybrid. This was also a trend in the remaining level of experience.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

In my investigation of salary_in_usd against experience_level in this section by looking at the impact of the three categorical quality features i.e company_size, work_year and remote_ratio. The multivariate exploration here showed that there indeed is a positive effect of both variables on the salary recoreded. Medium companies tends to have recorded more salary combaried to their counterparts. Also those who have high experience recorded largest amount of salary. Lastly intresting observetion is that as the work_year increases the salary_in_usd also tent to increase with the highest recoreded in 2022.

Were there any interesting or surprising interactions between features?¶

There is intresting feature to note that jods in medium companies have increased over time, whereas the small companies have reduced over time.

Conclusions¶

In conclusion, it is important to note all the exploration performed on the salary_in_usd the outliers have been removed using Z-score method. But all the other variables remained the same since they are categorical in nature.

When exploring the impact of other features in the dataset have on our my point of focus which is salary_in_usd, I find that most this variables affects salary_in_usd differently. For instance as work year have a postive effect on the salary, in that increase in work year also increases the total amount salary_in_usd recorded. The level of experience also have a postive effect on the salary_in_usd, as experience level increases the amount of salary recorded also tends to rise.

During the exploration process I noted that company_size and remote_ratio also have some impact on the salary_in_usd, but for my view this my be caused by the number of jobs involed. Why say so, in the analysis we find that most of the jobs were recorded in medium sized companies and remote also recorded the highest number of jobs compared to other remote_rario's. More research and analysis need to be performent in that area.